Comparing Duke, UNC, and NC State Twitter Activity: A Text Mining Approach

Soraya Campbell

2021-05-08

1 Twitter activity of Duke, UNC, and NC State

Duke University, NC State University, and UNC-Chapel Hill are three research I universities in central North Carolina. Together they comprise of the ‘Research Triangle,’ which, together with the Research Triangle Park and other area universities, drive innovation and growth in the region. While these three universities share the research I university designation, they are distinct in many aspects. Can we discern this distinction by their Twitter activity? What topics do they engage in the most? Is there any overlap?

This analysis is intended for anyone interested in text mining techniques, especially for social media data like Twitter, those working for institutions of higher education, especially communications professionals or social media managers, or anyone who is a Blue Devil 😈, TarHeel 🐏, or part of the Wolfpack 🐺.

In this walk-through, we’ll explore:

  • When and how often do each of these accounts tweet
  • What do these schools tweet about the most? Is there is a difference between schools?
  • Which words are more likely to be retweeted or favorited between these accounts?
  • Do certain words share a similar sentiment?

Let’s find out!

2 Getting the Data

Using the rtweet package, I pulled the latest 3200 tweets from each university’s main twitter account.

## get timelines
tmls <- get_timelines(c("dukeu", "ncstate", "unc"), n = 3200)
glimpse(tmls)

I then filtered only to include tweets from a one year time range, which resulted in 7522 observations and the following activity by account:

tmlsyr <- tmls %>%
    filter(created_at >= "2020-04-25")
Table 2.1: Number of tweets between Apr 20-Apr 21 by account
screen_name n
NCState 3057
UNC 2435
DukeU 2030

3 Timeseries

NC State is the leader in number of tweets for this past year. Let’s look at the activity distribution with a ts_plot and ggplot histogram.

## ts_plot by screen name
ts_plot(group_by(tmlsyr, screen_name), "months")
Time series plot of Twitter Activity

Figure 3.1: Time series plot of Twitter Activity

## histogram with facet wrap
ggplot(tmlsyr, aes(x = created_at, fill = screen_name)) + geom_histogram(position = "identity", 
    bins = 20, show.legend = FALSE) + facet_wrap(~screen_name, ncol = 1)
Histogram of Twitter activity

Figure 3.2: Histogram of Twitter activity

We can see that each university account has a steady flow of tweets that averages out to the following per month in this year time period:

Table 3.1: Average number of tweets per month by account
screen_name avg
NCState 235
UNC 187
DukeU 156

Seems like NC State has been a busy bee, with significant activity in the months of September and March that skews its average upward. What’s going on here?

By analyzing the word counts per month for NC State, we can see that the term #GivingPack dominated for these two months. This makes sense since this is a fundraising campaign which is conducted heavily via social media and some donations were matched depending on Twitter activity.

Table 3.2: Sample NC State March Tweets
text created_at
@ecauley92 By #GivingPack today, you’re helping us keep NC State accessible to students from every economic background. Thank you for your gift to @NCStateCED and the NC State Extraordinary Opportunity Scholarship! https://t.co/9hd8azNGKG 2021-03-24 18:52:04

.@PackWomensBball head coach @WolfpackWes invites you to join us for NC State Day of Giving! Let’s show once again that there is strength in the Pack.

<U+0001F4C6> Wednesday, March 24 <U+0001F4CD> https://t.co/6fJa9tcy0V #<U+FE0F><U+20E3> #GivingPack https://t.co/qCJNjXGLfk
2021-03-21 19:19:00
Table 3.3: Sample NC State September Tweets
text created_at
Our first pet-tastic winner is @NCStateEngr! Thanks to Willis_Doodle on Instagram for repping the Wolfpack on this day of #GivingPack. <U+0001F43A><U+0001F43E> Congrats on that extra $3,000! https://t.co/pKOStG733d 2020-09-16 19:20:56
Today is #GivingPack, a day to support students impacted by #COVID. Throughout the day, we’re sharing the impact of every dollar raised to #SupportSurvivors &amp; #WOC through the #SurvivorFund &amp; Dr. Frances Graham WOC Leadership Fund. Help us if you can: https://t.co/dhLpvX6VbN https://t.co/6eJNzsch3T 2020-09-16 20:03:08

We’ll do an additional time series look when we get to the section Comparing Word Usage.

4 Comparing Word Frequencies

Now let’s compare which words were used most frequently by each university. In particular, I am interested in seeing change in word frequencies over the year time span we are looking at. For this I will adapt the Twitter case study and code from Julia Silge and David Robinson’s Text Mining with R.

With Twitter data, some form of clean-up is usually necessary, especially before tokenizing text. Whether you remove hashtags, mentions (@)’s, emojis, or other special characters will depend on your purpose. Here, we’ll use the specialized “tweets” tokenizer to deal with Twitter text, which retains hashtags and mentions of usernames with the @ symbol. The code below removes retweets so that the data is only of tweets these accounts wrote themselves. Some special characters were removed as well.

remove_reg <- "&amp;|&lt;|&gt;"
tidy_tweets <- tmlsyr %>%
    filter(!str_detect(text, "^RT")) %>%
    mutate(text = str_remove_all(text, remove_reg)) %>%
    unnest_tokens(word, text, token = "tweets") %>%
    filter(!word %in% stop_words$word, !word %in% str_remove_all(stop_words$word, 
        "'"), str_detect(word, "[a-z]"))

Now we can calculate word frequencies for each university’s account.

frequency <- tidy_tweets %>%
    group_by(screen_name) %>%
    count(word, sort = TRUE) %>%
    left_join(tidy_tweets %>%
        group_by(screen_name) %>%
        summarise(total = n())) %>%
    mutate(freq = n/total)

Then the data is pivoted wider to get it ready to plot.

frequency <- frequency %>%
    select(screen_name, word, freq) %>%
    pivot_wider(names_from = screen_name, values_from = freq) %>%
    arrange(NCState, UNC, DukeU)

These plots compare word usage between Duke University, NC State, and UNC-Chapel Hill. Words near the line are used with about equal frequency between the universities whereas words further away from it are used much more by that university than the other.

Word Frequency between NCState and UNC Twitter accounts

Figure 4.1: Word Frequency between NCState and UNC Twitter accounts

Word Frequency between UNC and DukeU Twitter accounts

Figure 4.2: Word Frequency between UNC and DukeU Twitter accounts

Word Frequency between NCState and DukeU Twitter accounts

Figure 4.3: Word Frequency between NCState and DukeU Twitter accounts

You can see that there are some common shared words between these universities:

  • student/students
  • covid19
  • community
  • care/support

These words make sense since this timeframe of tweets coincides with the covid19 pandemic. Universities made many efforts to support students during these challenging times.

Although I won’t be performing a formal topic model analysis of these tweets, I think we can safely say that covid19 was a ‘topic’ shared among these universities throughout this year. We will come back to this particular topic when we do our Sentiment Analysis.

To round out our look at word frequencies, let’s find words that have changed in frequency at a moderately significant level in each account’s tweets. This can help us determine ‘trending’ words.

Trending words in DukeU’s tweets

Figure 4.4: Trending words in DukeU’s tweets

Significant here is the slope of #covid19 which started off very frequently at the beginning of the pandemic, trended down, and then saw a moderate rise in December. Also the inverse occurred with the term ‘vaccine’ as it saw its prominence rise in December when news of covid19 vaccines started to become available.

Trending words in NCState’s tweets

Figure 4.5: Trending words in NCState’s tweets

There’s alot going on in this graph, but #givingpack features prominently, starting high in September, falling down precipitously until March when it picks back up accordingly with our discovery of their fundraising campaigns those two months.

Trending words in UNC’s tweets

Figure 4.6: Trending words in UNC’s tweets

Again, alot going on, but of note, and makes sense, is the term/hashtag #uncgrad which started off high May 2020, coinciding with Spring graduation and then falling off until December (Fall graduation). Following this trend, it would have picked back up again May 2021. The term ‘Fall’ rising in August coincides with the beginning of the Fall term, and ‘testing’ could be because of changes in testing protocols for covid19.

5 Comparing Word Usage

Now let’s see what words were being tweeted about the most. I’ll explore this in two ways:

  1. Simple word counts to analyze usage
  2. Term frequency/inverse document frequency (tf-idf) values to analyze which words are more likely to come from one account versus the other

First, let’s do a simple word count of the top twenty words that are in the corpus.

Word Counts for Duke, UNC, and NC State

Figure 5.1: Word Counts for Duke, UNC, and NC State

Next, let’s use wordcloud to help us visualize these terms per university.

Here’s Duke, whose name features prominently in the visualization along with ‘students,’ ‘covid19,’ ‘pandemic,’ ‘faculty,’ ‘research,’ ‘health,’ among many other terms.

Figure 5.2: Word Cloud Duke University’s most tweeted terms

Now here’s NC State, where their motto ‘think and do’ features prominently along with the word ‘students.’ Also many of the university’s individual colleges’ or departments’ twitter screen names are displayed here. You can also see the words ‘gift,’ ‘support,’ ‘scholarship,’ ‘giving,’ ‘helping,’ ‘donors,’ and of course, ‘wolfpack’ featured here.

NC State Wordcloud

And finally, here’s UNC with ‘tar’ ‘heels’ front and center, which is what they call themselves, along with their hashtag ‘#unc,’ ‘students,’ ‘covid19,’ ‘pandemic,’ and ‘community’ being featured.